Group Members

  • Ali Altıntaş
  • Atacan Bayseferoğulları
  • Eren Melih Altun
  • Taylan Polat

Introduction and Data Overview

The main goal of this project is to understand the effectiveness of the marketing campaigns based on the customers’ coupon redemption by evaluating the relationship between their demographic information and transactions. The related data set was acquired from the link below: data

With regards to the data set; there are 7 tables including “Campaign Data, Item Data, Train, Final Train, Coupon Item Mapping, Customer Demographics, and Customer Transaction Data” but mainly it contains following information regarding the customers.

  • User Demographic Details
  • Campaign and Coupon Details
  • Product details
  • Previous transactions

Below, you may find necessary libraries for R codes (You can click to Code button at the right to unhide codes):

pti <- c("sqldf","dplyr","tidyverse","plotly","ggplot2","ggpubr","kableExtra","lubridate")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
    install.packages(pti)
}

library(sqldf)
library(dplyr)
library(tidyverse)
library(plotly)
library(ggplot2)
library(ggpubr)
library(kableExtra)
library(lubridate)

Below code chunk downloads data from github and makes necessary adjustments by cleaning and converting date columns.

githubURL <- "https://github.com/erenaltunn/Data/blob/main/BDA_Project_Data.RData?raw=true"
load(url(githubURL))

customer_transaction_data$date<-as.Date(customer_transaction_data$date, format = "%Y-%m-%d")

campaign_data$start_date<-as.Date(sub("(.{6})(.*)", "\\120\\2", campaign_data$start_date), format = "%d/%m/%Y")
campaign_data$end_date<-as.Date(sub("(.{6})(.*)", "\\120\\2", campaign_data$end_date), format = "%d/%m/%Y")

Data Analysis

Product Category and Product Based Analysis

Analysis of the products with the highest number of transactions was carried out. In order to perform this analysis, the customer_transaction_data table and the item_table have been joined. With the joining process, the categories of the products were obtained from the item_table and the total amount, total turnover and average discount amount data on category basis were obtained.

As a result of the study, the categories table with the highest total quantity, the highest turnover a month and highest average discount rate, respectively, is given below.

join_item_tran <- sqldf("
               SELECT aa.*, bb.brand, bb.brand_type, bb.category 
               FROM
                    (customer_transaction_data) aa
               LEFT JOIN
                    (item_data) bb ON  aa.item_id = bb.item_id")

Below table is ordered by quantity column and returns the best 10 category by the quantities sold. Because of the nature of Fuel, it has the highest total quantity as it is sold in liters. In these days, people are getting to be more conscious and prefers packaged meat than regular meat. Also, Natural Products category found a place here which means people care about their health.

kbl(sqldf("SELECT Category AS Category,sum(quantity)as Total_Quantity
      FROM join_item_tran
          GROUP BY category
          ORDER BY sum(quantity) DESC
          LIMIT 10"),caption="Quantities Sold by Category")%>%
   kable_classic(full_width = F, html_font = "Cambria")
Quantities Sold by Category
Category Total_Quantity
Fuel 149824927
Miscellaneous 21543590
Grocery 1244334
Pharmaceutical 189774
Packaged Meat 86562
Natural Products 66454
Meat 30585
Dairy, Juices & Snacks 27552
Bakery 25075
Prepared Food 14130

Below table is ordered by total turnover column. As we can see, there are lower average other discount (these discounts are not associated with personal campaigns) with higher turnover value.

kbl(sqldf("SELECT Category AS Category, sum(selling_price) as Total_Turnover,
 ROUND((AVG(other_discount)*-1),1) as Average_Other_Discount
      FROM join_item_tran
          GROUP BY category
          ORDER BY sum(selling_price) DESC
          LIMIT 10"),caption="Total Turnover by Category")%>%
   kable_classic(full_width = F, html_font = "Cambria")
Total Turnover by Category
Category Total_Turnover Average_Other_Discount
Grocery 86229623 17.5
Pharmaceutical 22871322 11.7
Fuel 12838240 30.6
Packaged Meat 8891032 35.5
Meat 4669631 39.8
Natural Products 4604165 7.6
Dairy, Juices & Snacks 2191815 12.7
Miscellaneous 2169650 21.5
Bakery 2148925 10.0
Prepared Food 1911024 10.5

Below table is ordered by average other discount column. Seafood, Meat and Package Meat have the highest discount values. As these kind of foods are decaying very quick, stores tend to make more discount on them.

kbl(
sqldf("
 SELECT Category AS Category, 
 ROUND((AVG(other_discount)*-1),1) as Average_Other_Discount
 FROM join_item_tran
          GROUP BY category
          ORDER BY avg(other_discount)
          LIMIT 10"),caption="Average Discount by Category")%>%
   kable_classic(full_width = F, html_font = "Cambria")
Average Discount by Category
Category Average_Other_Discount
Seafood 51.1
Meat 39.8
Packaged Meat 35.5
Fuel 30.6
Garden 30.3
Skin & Hair Care 29.5
Miscellaneous 21.5
Grocery 17.5
Dairy, Juices & Snacks 12.7
Pharmaceutical 11.7

Unit Price Analysis

A study was carried out to see the historical development of the prices of products sold in high quantity.

In this context, first of all, prices per unit were provided on a daily basis. Later, 9 products with the highest sales numbers in total were identified. Finally, these two tables were joined and the daily sales prices of the 9 products sold in the highest sum were reached.

In the first table below we can see the daily price changes of the top 9 products.

daily_unit_price <- sqldf(
"SELECT aa.*
 
 FROM
   (SELECT 
   date AS date, 
   item_id, 
   AVG(selling_price/quantity) as daily_unit_price
   FROM join_item_tran
   GROUP BY date, item_id) aa
       INNER JOIN
   (SELECT item_id, SUM(quantity)
   FROM join_item_tran
     GROUP BY item_id
     ORDER BY SUM(quantity) desc
     LIMIT 9) bb
       ON aa.item_id = bb.item_id")

#daily_unit_price$date <- as.Date(daily_unit_price$date, format="%Y-%m-%d")

ggplot(daily_unit_price, aes(x=date, y=daily_unit_price)) +
  geom_line(aes(color="#1A76FF")) + 
   facet_wrap(~ item_id) + 
   geom_smooth(color="#37536d", size=0.25)+
   theme_minimal()+ theme(legend.position = "none")+
   theme(axis.text.x = element_text(angle = 45))+ 
   labs(title = "Time Series Plot for Daily Unit Price")+
      xlab("Date")+
      ylab("Daily Unit Price")

However, when the relevant table is examined, we see that there are no sales for some products on certain dates. Therefore, in order to see the full changes in prices, we wanted to select 3 products with full series and examine the price development for these 3 products. Item ID 48973 and 49004 are in the category of Local Miscellaneous, Item ID 49009 is in the category of Fuel.

As a result of our investigations, prices for three products reached their peak in July 2012. Prices then drop rapidly and enter a steady upward trend until July of the following year.

daily_unit_price%>%
   filter(item_id == "48973" |  item_id=="49004" | item_id=="49009") %>%
      ggplot( aes(x=date, y=daily_unit_price)) +
      geom_line(aes(color="#1A76FF")) + 
      facet_grid(item_id~. ) + 
      geom_smooth(color="#37536d", size=0.25)+
      theme_bw()+ theme(legend.position = "none")+ 
   labs(title = "Time Series Plot for Daily Unit Price")+
      xlab("Date")+
      ylab("Daily Unit Price")

Brand Type and Category Analysis

When we look at 4 different sectors in local brand, It can be confidently state that the highest average endorsement belongs to Pharmaceutical field whereas the fewest pertains to local grocery brands. On the Established side, Meat sector is seemed that having the highest turnover chance as if pharmaceutical outliers are ignored.

df<- sqldf("
 SELECT 
   id.brand,
   id.brand_type,
   id.category,
   SUM(ct.quantity) AS total_qty,
   SUM(ct.selling_price)/SUM(quantity) as avg_turnover,
   (SUM(-ct.other_discount)+SUM(-ct.coupon_discount))/SUM(quantity) as avg_total_discount
 FROM 
   customer_transaction_data ct 
   LEFT JOIN item_data id USING(item_id)
 GROUP BY brand,category,brand_type")

y <- list(title = "Average Turnover",
  titlefont = F)

x <- list(title = "Brand Type",
  titlefont = F)

df <- df %>% filter (category=='Grocery' | category =='Meat' | category=='Pharmaceutical' | category=='Seafood' )%>%filter(avg_total_discount>0)
fig <- plot_ly(df, x = ~brand_type, y = ~log(avg_turnover), color = ~category, type = "box")
fig <- fig %>% layout(boxmode = "group",xaxis=x,yaxis=y)

fig

There are two types of brand types namely, “established” and “local” brands. The most 4 interesting brand category field which are Grocery, Meat, Pharmaceutical and Seafood industries are filtered.

While there are no significant differences among discount quantity distribution of categories in established brands, more vertical gaps are occurred among the figures for Local brands. Additionally, due to the insufficient data quantities in Meat and Seafood of Local brands, they have appeared with narrow variances figure in the graph.

Local brands in seafood industries are ensuring highest amount of discount in the graph. This situation can be explained by wealthiness of countries seafood resources lead to reduce cost in certain seasons.On the other hand, Local brands in meat sector could not provide vast amount of opportunity as much as the amount ensured by established brands in meat sector.

df<- sqldf("
 SELECT 
   id.brand,
   id.brand_type,
   id.category,
   SUM(ct.quantity) AS total_qty,
   SUM(ct.selling_price)/SUM(quantity) as avg_selling_turnover,
   (SUM(-ct.other_discount)+SUM(-ct.coupon_discount))/SUM(quantity) as avg_total_discount
 FROM 
   customer_transaction_data ct 
 LEFT JOIN 
   item_data id USING(item_id)
 GROUP BY 
   brand,category,brand_type ")


y <- list(title = "Average Discount",
  titlefont = F)

x <- list(title = "Brand Type",
  titlefont = F)

df <- df %>% filter (category=='Grocery' | category =='Meat' | category=='Pharmaceutical' | category=='Seafood' )%>%filter(avg_total_discount>0)
fig <- plot_ly(df, x = ~brand_type, y = ~log(avg_total_discount), color = ~category, type = "box")
fig <- fig %>% layout(boxmode = "group",xaxis=x,yaxis=y)

fig

Average Order Value by Age and Marriage Status

When we examine the average order value that customers pay for the products they buy, we observed that the maximum paying group is 70+. It was observed that 70+ age group pay an average of 427 USD for married ones, while single ones pay an average of 630 USD. It was observed that the group with the next highest payment is 56-70 group. For this reason we can conclude that people can pay more in retirement age. When each age group is examined, it is seen that married people can pay more, while this situation is reversed in the 70+ group. We can say that for this reason, there are more people who remain single due to deaths.

final_train$age_range <- case_when(final_train$age_range == "0" ~"18-25",
                                   final_train$age_range == "1" ~"26-35",
                                   final_train$age_range == "2" ~"36-45",
                                   final_train$age_range == "3" ~"46-55",
                                   final_train$age_range == "4" ~"56-70",
                                   TRUE ~"70+")

final_train$marital_status <- case_when(final_train$marital_status == "0"~"Married",TRUE ~ "Single")
customer_demographics$marital_status <- case_when(customer_demographics$marital_status == ""~"Married",TRUE ~ customer_demographics$marital_status)

cus_inf<-sqldf(
"SELECT 
   cd.age_range,
   cd.marital_status,
   AVG(ft.price_sum/ft.customer_id_count) AS pprice_sum 
 FROM customer_demographics cd
   LEFT JOIN 
      final_train ft ON ft.customer_id = cd.customer_id
 GROUP BY 
   cd.age_range,cd.marital_status
 ORDER BY 
   AVG(ft.price_sum) DESC")


mar <- cus_inf %>%
  filter(marital_status == "Married")

sing <- cus_inf %>%
  filter(marital_status == "Single")

married_psum <- round(mar$pprice_sum,2)
single_psum <- round(sing$pprice_sum,2)
age_rang <- mar$age_range
data <- data.frame(age_rang,married_psum,single_psum)

fig <- plot_ly(data, x = age_rang, y = married_psum, type = 'bar',name = 'Married',
               marker = list(color = 'rgb(55, 83, 109)'))
fig <- fig %>% add_trace(y = ~single_psum, name = 'Single',marker = list(color = 'rgb(26, 118, 255)'))
fig <- fig %>% layout(title = 'Average Basket Value by Age and Marriage Status',
                      xaxis = list(
                        title = "Age Groups",
                        tickfont = list(
                          size = 14,
                          color = 'rgb(107, 107, 107)')),
                      yaxis = list(
                        title = 'Average Order Value in USD',
                        titlefont = list(
                          size = 16,
                          color = 'rgb(107, 107, 107)'),
                        tickfont = list(
                          size = 14,
                          color = 'rgb(107, 107, 107)')),
                      legend = list(x = 0, y = 0.9, bgcolor = 'rgba(255, 255, 255, 0)', bordercolor = 'rgba(245, 246, 249, 1)'),
                      barmode = 'group', bargap = 0.15, bargroupgap = 0.1)

fig

Campaign and Customer Segmentation Analysis

Analyzing campaigns and customer base is crucial to marketing department, because generating revenue by marketing campaigns have the biggest share in most of the companies’ financial statements. Regarding to this, we made a basic campaign analysis and customer segmentation to drive targeting with the given data set.

Campaign Analysis

Below table shows us the return rates for each Coupon ID. Coupon ID 586 has the biggest return rate and it is associated with Grocery products. As grocery is an essential product group for living, it is understandable to be at the top of this list.

kbl(sqldf("SELECT 
   coupon_id AS 'Coupon ID',
   ROUND(AVG(redemption_status),3) AS 'Return Rate'
   
 FROM final_train
 GROUP BY coupon_id
 ORDER BY 2 desc
 LIMIT 10"))%>%
   kable_classic(full_width = F, html_font = "Cambria")
Coupon ID Return Rate
586 0.142
754 0.119
661 0.118
9 0.117
21 0.103
960 0.093
22 0.090
786 0.086
6 0.086
671 0.083

Below code chunk, creates two tables with customers which have average order value above 500 USD and below 500 USD. It shows their average coupon counts. It seems like higher average basket leads to higher coupon counts.

kbl(sqldf("WITH basket AS(
 SELECT customer_id,SUM(selling_price)/COUNT(DISTINCT Date) AS AOV
   FROM customer_transaction_data
   GROUP BY customer_id),

 coupon_cnt AS(
 
 SELECT 
   customer_id, 
   COUNT(coupon_id) AS Coupon_Count
 
 FROM 
   final_train
 GROUP BY
   customer_id
 )
 
 SELECT  'Less than 500' AS Category,ROUND(AVG(b.Coupon_Count),0) AS 'Average Count'
 FROM basket a 
   INNER JOIN coupon_cnt b ON a.customer_id=b.customer_id
 WHERE a.AOV<500
 
 UNION
 
 SELECT  'Greater than 500' AS Category,ROUND(AVG(b.Coupon_Count),0) AS 'Average Count'
 FROM basket a 
   INNER JOIN coupon_cnt b ON a.customer_id=b.customer_id
 WHERE a.AOV>500"
 ))%>%
   kable_classic(full_width = F, html_font = "Cambria")
Category Average Count
Greater than 500 56
Less than 500 46

RFM Customer Segmentation Analysis

RFM analysis is a marketing technique that allows us to explore different customer profiles in our customer base. RFM stands for Recency, Frequency and Monetary. In this analysis, we wanted to add campaign usage (RFMC) to consider customers who are responsive to marketing campaigns. In the first query, We are calculating RFMC values and setting Segments. Calculation method follows below logic:

  • Recency: Number of days past from the last purchase day of each customer.
  • Frequency: Number of transactions of each customer.
  • Monetary: Average Order Value of each customer.
  • Campaign Usage: Number of coupons used by each customer.

Let us explain RFM dataframe’s common table expressions one by one:

  • RFM cte calculates Recency, Frequency and Monetary values from customer transaction table.
  • coupon_cnt cte calculates the total coupon redemption for each customer.
  • aggregated cte joins RFM and coupon_cnt and calculates average order value for further segmentation.
  • scored cte creates segments based on the quartiles of each metric.
  • Final query concatenates RFMC scores and creates general segments.

RFM_Table dataframe’s query uses case when expressions to create specific segments for our customer base.

RFM<-sqldf(
"WITH RFM AS(
 
 SELECT 
   customer_id,
   ((SELECT MAX(date) as max_dt FROM customer_transaction_data)-MAX(date)) AS Recency,
   COUNT(DISTINCT date) AS Frequency,
   SUM(selling_price) AS Monetary
   FROM customer_transaction_data
   GROUP BY customer_id
 ),
 
 coupon_cnt AS(
 
 SELECT 
   customer_id, 
   COUNT(coupon_id) AS Coupon_Count
 
 FROM 
   final_train
 GROUP BY
   customer_id
 ),
 
 aggregated AS(
 SELECT 
   a.customer_id AS Customer_ID, 
   a.Recency, 
   a.Frequency, 
   ROUND(a.Monetary/a.Frequency,1) AS AOV, 
   b.Coupon_Count
 
 FROM 
   RFM a 
 INNER JOIN 
  coupon_cnt b ON a.customer_id=b.customer_id
 GROUP BY 
   a.customer_id),
   
 scored AS(  
 SELECT
   Customer_ID,
   Recency,Frequency,AOV, Coupon_Count,
   ntile(4) over (order by Recency desc) as Recency_Score,
   ntile(4) over (order by Frequency) as Frequency_Score,
   ntile(4) over (order by AOV) as Monetary_Score,
   ntile(4) over (order by Coupon_Count) as Cmp_Count_Score
 FROM 
   aggregated)
 
 SELECT *, Recency_Score||Frequency_Score||Monetary_Score||Cmp_Count_Score AS Segment
 
 FROM scored")

kbl(head(RFM))%>%
   kable_classic(full_width = F, html_font = "Cambria")
Customer_ID Recency Frequency AOV Coupon_Count Recency_Score Frequency_Score Monetary_Score Cmp_Count_Score Segment
1283 429 28 1071.5 14 1 1 3 1 1131
759 380 11 1638.7 22 1 1 4 1 1141
975 364 17 1183.6 18 1 1 3 1 1131
1132 346 30 905.8 19 1 1 2 1 1121
1112 265 72 949.2 21 1 2 2 1 1221
1261 233 30 1753.3 27 1 1 4 1 1141
RFM_Table<-sqldf(
"SELECT 
   *, 
   CASE 
   WHEN Segment IN ('4444','4344','4434','4433','4443','3444','3443','4442')
     THEN 'Star'
   
   WHEN Segment IN ('3442','3441','3431','3432','3443','3444','3433','3434','4431','4432','4433','4434',
   '4341','4342','4343','4344','3333','3331','3332','3334','3341','3342','3343','3344','4331','4332','4333','4334')
     THEN 'High Value'
   
   WHEN Segment IN ('3411','3412','3413','3414','4411','4412','4413','4414','4422','4324',
   '4421','4423','4424','4311','4312','4313','4314','3422','3421','3423','3424','4322','4321','4323','3311',    '3312','3313','3314','3321','3323','3324','3322','3122','3121','3123','3124','3221','3223','3223','3224')
      THEN 'Second Role'
      
   WHEN Segment IN ('1331','1332','1431','1432','1433','1333','1334','1434','1341','1342',
   '1343','1344','1441','1442','1443','1444','1244','1243','1242','1241')
      THEN 'Churned Best'
      
   WHEN Segment IN ('1113','1114','1213','1214','1123','1124','2113','2114','2213','2214',
   '2123','2124','1223','1224','2224','2224')
      THEN 'Responsives'
   
   WHEN Segment IN ('4141','4142','4143','4144','4131','4132','4133','4134','4241','4242',
   '4243','4244','4231','4232','4233','4234')
      THEN 'High Value New'
      
   WHEN Segment IN ('4111','4112','4113','4114','4121','4122','4123','4124','4211','4212',
   '4213','4214','4221','4222','4223','4224')
      THEN 'Low Value New'
      
   WHEN Segment IN ('1111','1211','1121','1112','2111','2211','1221','1122','1222','2221',
   '2122','2212','2112','2121','2222','1212','1131','1231','1132','1232','1141','1241','1142','1242','1133',    '1131','1132','1134','1141','1142','1143','1144','1321','1322','1323','1324','1311','1312','1313','1314')
      THEN 'Churned'
      
   WHEN Segment IN ('2331','2332','2333','2334','2231','2232','2233','2234','2341','2342',
   '2343','2344','2431','2432','2433','2434','3331','3332','3333','3334','3231','3232','3233','3234','2341',    '3342','3343','3344','3431','3432','3433','3434','2441','2442','2443','2444','4241','4242','4243','4241',    '4422','4423','4424','3241','3242','3243','3244')
      THEN 'Promising'
   
   ELSE 'Other'
     
     END AS RFM_Segment
      FROM RFM")

kbl(head(RFM_Table))%>%
   kable_classic(full_width = F, html_font = "Cambria")
Customer_ID Recency Frequency AOV Coupon_Count Recency_Score Frequency_Score Monetary_Score Cmp_Count_Score Segment RFM_Segment
1283 429 28 1071.5 14 1 1 3 1 1131 Churned
759 380 11 1638.7 22 1 1 4 1 1141 Churned
975 364 17 1183.6 18 1 1 3 1 1131 Churned
1132 346 30 905.8 19 1 1 2 1 1121 Churned
1112 265 72 949.2 21 1 2 2 1 1221 Churned
1261 233 30 1753.3 27 1 1 4 1 1141 Churned

You may find Segment definitions below:

  • Star: These customers are the best and should be treated like champions.
  • High Value: These customers are high valued customers that we don’t want to lose.
  • Promising: These customers are very loyal but not high spenders
  • Second Role: These customers are similar to the Promising segment. They are loyal but they can slip away to Churn segment.
  • High Value New: These customers are the Rookies with high average order value.
  • Low Value New: These customers are the Rookies with low average order value.
  • Churned: These customers are lost.
  • Churned Best: These are old Star or High Value customers that lost.
  • Responsives: These customers are responsive to campaigns, so, we should need triggers to activate.

Below plot shows us the distributions of our segments. We have Churned customers at the top. It can be because of the non-activated customers that we couldn’t place the second order to their baskets. We have middle to low amount of High Value and Star customers which is good because we can’t afford to treat a lot of customers as champions.

p<-RFM_Table%>%select(Customer_ID,RFM_Segment)%>%
   group_by(RFM_Segment)%>%
   summarize(cnt=n())%>%
   ggplot(aes(x=reorder(RFM_Segment, cnt), y=cnt)) +
      geom_bar(stat="identity", fill="#37536d")+
      theme_minimal()+
      coord_flip()+
      labs(title = "RFM Segment Distribution")+
      xlab("Segment")+
      ylab("Count")

ggplotly(p)

Below plots show us the distribution of RFMC metrics based on our RFM segments.

We have higher median monetary in Churned Best, Star and High Value segments. This makes sense but gives us an insight that we should find a way to get back Churned Best customers immediately as they have the highest median for monetary.

As expected, we have zero median recency for New segments and Stars. It is obvious that Stars are purchasing from our stores frequently.

We have higher median frequency for Stars and Second Roles. We would expect that because Second Role customers are loyal but they have a tendency to become a Churned customer. It is possible for them to have higher frequency but they might be low on monetary part.

We have higher Campaign Usage in Star and Responsive segments. This is expected because we created Responsive segment based on the coupon redemption data. These kind of customers need a trigger to purchase but they might not be profitable like other customer segments. Because they always wait for a campaign or discount to purchase. So if we would get costs for items, we could show that the net profit for responsive segment is really low than other segments.

ggarrange(RFM_Table%>%select(RFM_Segment,AOV)%>%
   group_by(RFM_Segment)%>%
   summarize(mdn=median(AOV))%>%
   ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn)) +
      geom_bar(stat="identity", fill="#37536d")+
      theme_minimal()+
      coord_flip()+ labs(title = "Median Monetary")+
      xlab("Segment")+
      ylab("Median"),

RFM_Table%>%select(RFM_Segment,Recency)%>%
   group_by(RFM_Segment)%>%
   summarize(mdn=median(Recency))%>%
   ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn)) +
      geom_bar(stat="identity", fill="#37536d")+
      theme_minimal()+
      coord_flip()+ labs(title = "Median Recency")+
      xlab("Segment")+
      ylab("Median"),

RFM_Table%>%select(RFM_Segment,Frequency)%>%
   group_by(RFM_Segment)%>%
   summarize(mdn=median(Frequency))%>%
   ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn))+
      geom_bar(stat="identity", fill="#37536d")+
      theme_minimal()+
      coord_flip()+ labs(title = "Median Frequency")+
      xlab("Segment")+
      ylab("Median"),

RFM_Table%>%select(RFM_Segment,Coupon_Count)%>%
   group_by(RFM_Segment)%>%
   summarize(mdn=median(Coupon_Count))%>%
   ggplot(aes(x=reorder(RFM_Segment, mdn), y=mdn))+
      geom_bar(stat="identity", fill="#37536d")+
      theme_minimal() +
      coord_flip()+ labs(title = "Median Coupon Count")+
      xlab("Segment")+
      ylab("Median"))

Conclusion

  • Category based analysis was made to understand how does discount change in different categories. It was obvious that categories with higher turnover sold with low discount.
  • Best seller items were analyzed to see how their daily unit price changes. Peaks seem to be in the similar time points.
  • Turnover and Average Discount distributions are analyzed based on their Categories and Brand Types. Distributions were very similar for both Turnover and Discount.
  • We analyzed customer demographics by grouping customers into Age Groups and Marriage Status. Older ages have higher average older value.
  • Campaigns were analyzed by the simple Return Rate KPI. Grocery campaigns were the best campaign with its return rate about 14%.
  • Customer base was segmented into different customer groups. By the help of this segmentation, company can take personalized actions for their campaign and communication processes.